Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightningDataModule for webdataset #100

Merged
merged 70 commits into from
Aug 30, 2024
Merged

Conversation

DejunL
Copy link
Collaborator

@DejunL DejunL commented Aug 14, 2024

This PR achieves the following:

  1. implement a LightningDataModule of webdataset called WebDataModule. It takes a set of webdataset tar files and various webdataset config settings as input and setups the WebDataset and WebLoader to be used later by the Lightning.Trainer workflows
  2. implement a LightningDataModule PickledDataWDS that inherits from the aforementioned WebDataModule that allows the user to experiment with different train/val/test splits of the input pickled data to be used in creating the webdataset tar files
  3. Tests, docstring and README for 1 and 2
    4. Migrate a subset of BioNeMo1 DiffDock data processing routines for the purpose of testing the data modules in 1 and 2 in the context of DiffDock model
    5. Tests for 4 (these are postponed for the next diffdock PR)

@DejunL DejunL requested a review from guoqing-zhou August 14, 2024 23:52
@DejunL DejunL self-assigned this Aug 15, 2024
@DejunL DejunL requested a review from kdidiNVIDIA August 15, 2024 17:11
@malcolmgreaves malcolmgreaves marked this pull request as draft August 15, 2024 20:25
Copy link
Collaborator

@malcolmgreaves malcolmgreaves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DejunL thanks for the contribution, but this PR is too large to review. It needs to be broken up into more readily reviewable chunks. If you want more context, please refer to the contributing and PR review guides in the repository. Please aim for 500-1000 lines at the maximum to make this tractable for reviewers. Thanks!

After a first pass, my suggestion is to do this by:

  • having a PR for the webdataset
  • having another PR for diffdock utils

Webdataset looks like it is diffdock specific, so please put this into bionemo-diffdock and hold off on putting it into bionemo-core until we have confirmed multiple use cases.

Additionally, the diffdock utils need unit tests before they can be merged in.

@DejunL
Copy link
Collaborator Author

DejunL commented Aug 15, 2024

@DejunL thanks for the contribution, but this PR is too large to review. It needs to be broken up into more readily reviewable chunks. If you want more context, please refer to the contributing and PR review guides in the repository. Please aim for 500-1000 lines at the maximum to make this tractable for reviewers. Thanks!

After a first pass, my suggestion is to do this by:

* having a PR for the webdataset

* having another PR for diffdock utils

Webdataset looks like it is diffdock specific, so please put this into bionemo-diffdock and hold off on putting it into bionemo-core until we have confirmed multiple use cases.

Additionally, the diffdock utils need unit tests before they can be merged in.

Thanks for the suggestion. I should clarify the motivation in the OP but let me do it here too. The webdatamodule is not diffdock specific. Nothing in the datamodule.py depends on any specific model. DiffDock-pp, which we are planning to add to BioNeMo soon, is also using webdataset. @kdidiNVIDIA is also interested in trying to use webdataset in the protein foundation model. So putting webdatamodule into bionemo-diffdock will not achieve the desirable modularization.

My original thinking of putting webdatamodule into bionemo-core was seeing the PRNGDatasetShuffler in it. The webdataset data processing routines share the same or even higher level of generalizability (considering PRNGDatasetShuffler only works for indexable dataset but not iterable dataset while webdataset can handle both). So I think bionemo-core would be a good place to start with.

In the context of @guoqing-zhou working on modularizing DiffDock specific code, I can remove the the diffdock tests and utils from this PR and only focus on the generic webdatamodule. This would also reduce the number of lines to review. What do you think? @malcolmgreaves

@malcolmgreaves
Copy link
Collaborator

#100 (comment)

Following up from the slack discussion - let's double check alignment and summarize next steps:
(1) Make a new sub-package for the webdataset dataloader. Make a PR with this.
(2) Make a 2nd PR for Diffdock. It will depend on this webdataset dataloader subpackage.

How does this sound @DejunL ?

@DejunL DejunL force-pushed the dejunl/diffdock-datamodule branch 3 times, most recently from a114765 to 082b681 Compare August 20, 2024 00:09
@DejunL
Copy link
Collaborator Author

DejunL commented Aug 20, 2024

Following up from the slack discussion - let's double check alignment and summarize next steps:
(1) Make a new sub-package for the webdataset dataloader. Make a PR with this.
(2) Make a 2nd PR for Diffdock. It will depend on this webdataset dataloader subpackage.

First of all, we should stop referring this PR to "dataset" or "dataloader" as this is about lightning data module for using webdataset.

I have just moved the code under bionemo-webdatamodule and remove the diffdock specific code because the webdatamodule doesn't depend on diffdock at all (the original code you were referring to was only used for testing webdatamodule). The diffdock migration to bionemo2 will be future work. @malcolmgreaves

@DejunL DejunL force-pushed the dejunl/diffdock-datamodule branch from 22d41d3 to dd054a8 Compare August 20, 2024 00:37
@DejunL DejunL marked this pull request as ready for review August 20, 2024 00:45
@DejunL DejunL changed the title Draft: LightningDataModule for webdataset LightningDataModule for webdataset Aug 20, 2024
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of our sub-packages currently have a duplicated LICENSE file. Is this not needed?

If we clean this up, will the LICENSE still be logged properly in the package metadata? For example will this proposed test still pass? #82

@DejunL DejunL force-pushed the dejunl/diffdock-datamodule branch from c851688 to a35765f Compare August 29, 2024 23:54
@DejunL
Copy link
Collaborator Author

DejunL commented Aug 29, 2024

/build-ci

@DejunL DejunL merged commit bf42f18 into v2-main Aug 30, 2024
3 checks passed
@DejunL DejunL deleted the dejunl/diffdock-datamodule branch August 30, 2024 17:01
@DejunL DejunL restored the dejunl/diffdock-datamodule branch September 3, 2024 17:21
@DejunL DejunL deleted the dejunl/diffdock-datamodule branch September 3, 2024 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants